-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: change BlockManager pickle format to work with dup items #7370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: change BlockManager pickle format to work with dup items #7370
Conversation
this is breaking compat by making a new format which is incompatible with older versions we don't change this format lightly (maybe should be changed but 0.15 would be better) so see if can fix w/o changing the format now see tests in io/tests/test_pickle.py need to add duplicate items to the generation script and make test pickles for 0.14 that generate this error |
Well, this is as simple as it could get, forward compatibility is tricky, especially if not kept in mind when a protocol is designed. I suppose, if we dropped the "beta" protocol mentioned in the inline comment, we could store the dictionary as stated above in the 4-th element of the state tuple which, as of 0.14.0, is silently skipped with the first three being populated as before. Pickle is smart enough not to serialize same object multiple times, so it's only the metadata that will get duplicated... |
what I mean is don't change the protocol at all |
Ah, I've lost the first paragraph of previous comment somewhere. What it said is that I don't see how equal items can be told from one another unless we change the protocol. So, I guess, this PR should wait till 0.15 then. |
why would you not catch the reindexing error and then just do this?
I agree prob should change the format to be better as well. Pls add additional tests that mimic the user behavior (in io/pickle). |
Hmm, this would only work for block managers with one blocks, right? Writing duplicate-aware format into the extra tuple element and adding a couple of other tricks made pickling forward compatible with 0.12.0 which is the oldest version I have at hand. Which makes me wonder if there any guidelines on io deprecation, i.e. when ensuring forward compatibility which version should be the oldest to be supported? |
if you can forward compat to 0.12 that's excellent (and good enough). the 0.12-0.13.0 pickle transition was difficult because of the subclassing, but the format didn't change per se (though was extended a bit for sparse types, which already pickle with a dict rather than tuples). |
Ok, I'll add more tests and make sure sparse items are handled, too. |
@immerrr how's this coming? |
I got myself too deep into how do I implement consistent forward-compatibility checks. Doing this once is easy, ofc, but I don't see an easy way to check that short of reinventing something vbench-ish, as in write a pickle with target revision, check out baseline revision, build it, unpickle the file and compare to etalon, and so on... |
you can generate pickles for a version here: https://github.com/pydata/pandas/blob/master/pandas/io/tests/generate_legacy_pickles.py |
This is good for backward compatibility, yes, but how do I check that 0.13 haven't ceased to read current pickles? |
well one time test is to do it in a virtual env |
I did some testing, 0.13.1 works fine (doesn't croak, at least), as does 0.14.0. Alas, 0.12.0 fails to read Series (due to SingleBlockManager class). Is there a way around that? Also, I cannot help with many of the platforms I find in |
I don't think forward compat with 0.12.0 is a big deal, IOW I don't expect 0.12.0 pandas to be able to read pickles generated in 0.13 + at all (its really just backward compat that we care about), though forward compat to 0.13.0 is nice too. |
Ok, what about legacy pickles I can't easily create? |
don't need to create them, they already exist. What are you trying to do? |
Well the format is about to change slightly so I figured maybe it's worth double checking it won't break in the future. I've also bumped the original message adding a rationale for the chosen format extension. If you're ok with both the rationale and the implementation I.'ll put up a release note tomorrow. On Thu, Jun 19, 2014 at 12:41 AM, jreback [email protected]
|
# discard anything after 3rd, support beta pickling format for a little | ||
# while longer | ||
ax_arrays, bvalues, bitems = state[:3] | ||
extra_state = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you add a version tag instead, this seems kind of odd to do it this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an "upcoming" version tag, as in the first stable version having this serialization format, serving as a key into this dictionary. Or do you mean something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no I meant having the version as the key was odd, why not just as a key-value in the dict, e.g. version : '0.14.1'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured, something like this:
for ver in list(supported_versions):
if ver in state:
setstate_ver(state[ver])
would be easier on the eye than:
for ver in list(supported_versions):
for d in state.values():
if d['version'] == ver:
setstate(d)
break
But it's not a strong opinion, rather a gut feeling. If you insist, I'll make the version a dictionary element again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, either way is fine. Just trying to make it easy on future versions.
I think the format should be a dict. When I meant forwards compat I a) meant 'easy' to read / modify w/o breaking in the future, and backwards if possible by prior versions (e.g. >= 0.13.0), not sure if that is possible even. Or maybe even not break it unless it is really necessary (e.g. no other way to do it). |
0.13.0+ are able to read this "extended" format, but there is a difference which might need to be guarded against future breakage, which AFAIU is done via the legacy pickle tests. Or do I get that wrong? |
the extended tests are really just to maintain back compat (unfortunately they don't cover EVERYTHING). I don't think their are duplicated indexers there. That said could generate some new pickles for particular versions (e.g. just run a 'new' version of the script in a virtual env). Its robust to adding things (just skips) |
Ok, I'm confused. Do you want me to add something to legacy pickle verification infrastructure or not? |
- Support pickling ``Series``, ``DataFrame`` and ``Panel`` objects with | ||
non-unique labels along *item* axis (``index``, ``columns`` and ``items`` | ||
respectively). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue reference here
minor comment I think I'd like you to modify the legacy_pickle script. to add some examples of duplicated frames. then run and generate it for 0.13 (and add the generated pickle to the repo, on your platform is fine). It should then break master (w/o this fix), but work with it. |
I don't think it'll work with this fix for a pickle of dup-item object generated in 0.13 code, there will just be no placement information to use. |
ok, should change the script going forward (egenerated pick) in any event; so when we make the pickles for 0.14.1 it will preserve pickle changes. |
@immerrr ok....pls rebase.... |
@jreback, fixed, rebased & green |
ENH: change BlockManager pickle format to work with dup items
thanks |
Old
BlockManager
pickle format stored items which were ambiguous and not enough for unpickling if non-unique.After recent BlockManager overhaul, #6745, it's now possible to it's no longer necessary to share items/ref-items between Blocks and their respective managers
, so Blocks can now be safely pickled/unpickled on their own, but Blocks still cannot be safely pickled/unpickled on their own, because they a) are part of internal API and are subject to change and b) were never intended to be serialized like that and their setstate is written in non-forward-compatible manner.So, I've took the freedom to make the format more "extendable" by storing the state as the fourth element in BlockManager state tuple:
I was aiming for the following design goals:
This PR should close #7329.
ensure forward compatibility with 12.0 ?not necessary